-
Notifications
You must be signed in to change notification settings - Fork 162
Adding modelopt_run_config.yaml and a main function for megatron data preprocessing #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
preprocessing Signed-off-by: Chenhan Yu <[email protected]>
WalkthroughAdds master-only YAML export of a sanitized transformer config during distributed checkpoint saving (filters keys, stringifies values, and writes Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Trainer
participant SaveFn as save_sharded_modelopt_state
participant Model
participant FS as Filesystem
User->>Trainer: Trigger distributed checkpoint save
Trainer->>SaveFn: save_sharded_modelopt_state(model, checkpoint_name, ...)
SaveFn->>Model: Read model[0].config.__dict__
SaveFn->>SaveFn: _parse_transformer_config(config) [filter/drop/stringify]
alt Master process only
SaveFn->>FS: Write {checkpoint_name}/modelopt_run_config.yaml (yaml.dump, deepcopy, include nvidia_modelopt_version)
end
SaveFn-->>Trainer: Continue existing save flow
sequenceDiagram
autonumber
participant User
participant CLI as megatron_preprocess_data.py:main
participant HF as HuggingFace Datasets
participant FS as Filesystem
participant Pre as megatron_preprocess_data
User->>CLI: Invoke with args (input_path or dataset/subset/split, output_dir, json_keys, ...)
alt input_path provided
CLI->>Pre: megatron_preprocess_data(input_path_files, tokenizer, ...)
else Fetch from HF
CLI->>HF: load_dataset(dataset, subset, split(s))
HF-->>CLI: Dataset(s)
loop for each split
CLI->>CLI: Validate json_keys against dataset.features
alt keys valid
CLI->>FS: Write output_dir/{dataset}_{subset}_{split}.jsonl
CLI->>CLI: Append path to input list
else invalid
CLI->>CLI: Skip split
end
end
CLI->>Pre: megatron_preprocess_data(collected input files, tokenizer, ...)
end
Pre-->>CLI: Tokenization and binary output complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (3)
88-94
: Appending EOD can exceed max_sequence_lengthReserve 1 slot for EOS when truncating.
Apply this diff:
- enc_len += len(encoded) - if self.max_sequence_length is not None: - encoded = encoded[: self.max_sequence_length] + enc_len += len(encoded) + if self.max_sequence_length is not None: + max_len = self.max_sequence_length - (1 if self.append_eod else 0) + encoded = encoded[: max(0, max_len)] - if len(encoded) > 0 and self.append_eod: - encoded.append(_Encoder.tokenizer.eos_token_id) + if self.append_eod: + encoded.append(_Encoder.tokenizer.eos_token_id)
119-124
: Pool leak and potential spawn pickling issue; use context manager and top-level initializerClose the Pool deterministically and avoid bound-method initializer.
Apply this diff:
- fin = open(input_file_name, encoding="utf-8") - - pool = multiprocessing.Pool(self.workers, initializer=encoder.initializer) - encoded_docs = pool.imap(encoder.encode, fin, 32) + with open(input_file_name, encoding="utf-8") as fin: + with multiprocessing.Pool( + self.workers, + initializer=_pool_init, + initargs=(encoder.tokenizer_name_or_path,), + ) as pool: + encoded_docs = pool.imap(encoder.encode, fin, 32)Add this helper (outside this hunk) to the module:
def _pool_init(tokenizer_name_or_path: str): _Encoder.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
144-152
: KeyError risk when some outputs already exist + token counter bugBuilders may be missing for some keys; loop must honor active builder keys. Also final_enc_len uses a stale “key” variable.
Apply this diff:
- total_doc_len, total_enc_len, final_enc_len = 0, 0, 0 - for i, (doc, sentence_lens, (doc_len, enc_len)) in enumerate(encoded_docs, start=1): + total_doc_len, total_enc_len, final_enc_len = 0, 0, 0 + active_keys = tuple(builders.keys()) + for i, (doc, sentence_lens, (doc_len, enc_len)) in enumerate(encoded_docs, start=1): total_doc_len += doc_len total_enc_len += enc_len - final_enc_len += sum(sentence_lens[key]) - for key in doc: - builders[key].add_document(doc[key], sentence_lens[key]) + final_enc_len += sum(sum(sentence_lens[k]) for k in active_keys if k in sentence_lens) + for k in active_keys: + builders[k].add_document(doc[k], sentence_lens[k]) self._print_processing_stats(i, total_doc_len, total_enc_len)
🧹 Nitpick comments (2)
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (1)
25-25
: Avoid hard dependency on PyYAML at import timeMove yaml import to the write-path to prevent module import errors when this code path isn’t used.
Apply this diff:
-import yaml +# (lazy import inside save_sharded_modelopt_state to avoid hard dependency)modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)
108-109
: Guard modulo when log_interval is 0Avoid ZeroDivisionError.
Apply this diff:
- if count % self.log_interval == 0: + if self.log_interval and count % self.log_interval == 0:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
(2 hunks)modelopt/torch/utils/plugins/megatron_preprocess_data.py
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (1)
modelopt/torch/utils/distributed.py (1)
is_master
(75-77)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (2)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)
34-41
: Declare runtime deps: requests, datasets
- File: modelopt/torch/utils/plugins/megatron_preprocess_data.py (lines 34–41) imports requests and datasets — add these to runtime dependency manifests (requirements.txt, pyproject.toml, setup.cfg install_requires) and document their inclusion.
- Verification: ran the supplied script; it returned no output so manifests could not be located in the sandbox. Manually verify dependency manifests and add entries if missing.
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py (1)
143-149
: Confirm dependency declaration for PyYAMLmodelopt/torch/opt/plugins/mcore_dist_checkpointing.py imports yaml and writes a YAML file at runtime (lines 143–149); ensure 'pyyaml' is declared in pyproject.toml / setup.py / requirements.txt — my automated scan hit encoding errors and couldn't reliably verify manifests, so confirm and add if missing.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #341 +/- ##
=======================================
Coverage 73.82% 73.82%
=======================================
Files 172 172
Lines 17438 17438
=======================================
Hits 12874 12874
Misses 4564 4564 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
if args.input_path is None: | ||
args.input_path = [] | ||
if args.dataset is None: | ||
args.dataset = "nvidia/Nemotron-Pretraining-Dataset-sample" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not set --dataset
default in the parser above?
split = entry["split"] | ||
|
||
if args.subset is not None and args.subset != subset: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you replace line 253 and 255 with skip_process=True
to make this more clear
Signed-off-by: Chenhan Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (3)
231-231
: Boolean flag parsing: resolved as suggestedSwitch to action="store_true" looks good and addresses earlier feedback.
247-248
: Guard JSON parsing and missing "splits" keyAccessing response.json()["splits"] assumes happy path. Guard and early‑out if empty.
Apply this diff:
- for entry in response.json()["splits"]: + data = response.json() + splits = data.get("splits", []) + if not splits: + print(f"No splits found for dataset {args.dataset}; server response keys: {list(data.keys())}") + return + for entry in splits:
239-246
: Create output_dir before writing JSONL filesYou write JSONL files before megatron_preprocess_data() creates the directory. Ensure it exists here to avoid runtime failures.
Apply this diff:
if args.input_path is None: args.input_path = [] + Path(args.output_dir).mkdir(parents=True, exist_ok=True) - response = requests.get( - "https://datasets-server.huggingface.co/splits?dataset={}".format(args.dataset), - timeout=10, - ) + # Fetch splits with robust error handling + try: + response = requests.get( + "https://datasets-server.huggingface.co/splits", + params={"dataset": args.dataset}, + timeout=10, + ) + response.raise_for_status() + except requests.RequestException as e: + print(f"Failed to fetch dataset splits for {args.dataset}: {e}") + return
🧹 Nitpick comments (6)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (6)
270-273
: Sanitize filename components and use Path joinsDataset names contain “/” and subset may be None. Current concatenation creates nested dirs and “None” in filenames.
Apply this diff:
- json_file_path = args.output_dir + "/" + name + "_" + subset + "_" + split + ".jsonl" - dataset.to_json(json_file_path) - args.input_path += [json_file_path] + safe_name = name.replace("/", "__") + safe_subset = subset or "default" + json_file_path = Path(args.output_dir) / f"{safe_name}_{safe_subset}_{split}.jsonl" + dataset.to_json(str(json_file_path)) + args.input_path.append(str(json_file_path))
119-124
: Close resources deterministically (file + Pool)Open file handle and multiprocessing.Pool aren’t context-managed. Use context managers to avoid leaks on exceptions.
Apply this diff:
- print("Opening", input_file_name) - fin = open(input_file_name, encoding="utf-8") - - pool = multiprocessing.Pool(self.workers, initializer=encoder.initializer) - encoded_docs = pool.imap(encoder.encode, fin, 32) + print("Opening", input_file_name) + with open(input_file_name, encoding="utf-8") as fin, \ + multiprocessing.Pool(self.workers, initializer=encoder.initializer) as pool: + encoded_docs = pool.imap(encoder.encode, fin, 32) @@ - fin.close() - for key in builders: + for key in builders: builders[key].finalize(output_idx_files[key]) return final_enc_lenAlso applies to: 153-156
192-194
: mkdir should create parent dirsUse parents=True to handle nested output paths.
Apply this diff:
- Path(output_dir).mkdir(exist_ok=True) + Path(output_dir).mkdir(parents=True, exist_ok=True)
67-69
: Avoid double-loading the tokenizer (performance/cold-start)Tokenizer is loaded once to read vocab_size and again in worker initializers. Caching helps, but startup is still duplicated. Consider:
- Load once in parent, read vocab_size, and pass tokenizer_name_or_path plus env var TRANSFORMERS_OFFLINE=1 or local_files_only=True in workers; or
- Add an optional vocab_size arg to bypass the first load (derive from tokenizer config if available).
Also applies to: 193-195
61-63
: Heuristic max_document_length (8x) — confirm intentHardcoding max_document_length = max_sequence_length * 8 is opaque. Either document the rationale or expose it as a CLI arg (e.g., --document_length_multiplier).
261-266
: Key validation against dataset columns can be more robustSome datasets have nested or dict features; dataset.features may not reflect flattened column names. Prefer checking key in dataset.column_names and optionally warn with available columns.
Apply this diff:
- for key in args.json_keys: - if key not in dataset.features: + for key in args.json_keys: + if key not in dataset.column_names: print(f"Key {key} not found in dataset features. Skipping...") skip_processing = True break
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
(3 hunks)modelopt/torch/utils/plugins/megatron_preprocess_data.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)
modelopt/onnx/quantization/__main__.py (1)
main
(255-294)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split") | ||
parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix default --split: make it None to process all splits when omitted
Current default "train" contradicts the PR description (“all subsets and splits if subset/split omitted”) and filters to train-only. Set default to None and keep the existing filter logic.
Apply this diff:
- parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split")
+ parser.add_argument(
+ "--split",
+ type=str,
+ default=None,
+ help="Hugging Face Hub dataset split (process all if omitted)",
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split") | |
parser.add_argument( | |
parser.add_argument( | |
"--split", | |
type=str, | |
default=None, | |
help="Hugging Face Hub dataset split (process all if omitted)", | |
) | |
parser.add_argument( |
🤖 Prompt for AI Agents
In modelopt/torch/utils/plugins/megatron_preprocess_data.py around lines
225-226, the --split argument currently defaults to "train" which incorrectly
limits processing to the train split; change the parser.add_argument for
"--split" to use default=None so that when omitted the existing logic will
process all splits/subsets, leaving the current filter logic unchanged.
Signed-off-by: Chenhan Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (4)
231-231
: Nice fix: --append_eod uses store_true.
225-226
: Default --split should be None to process all splits when omitted (matches PR description).- parser.add_argument("--split", type=str, default="train", help="Hugging Face Hub dataset split") + parser.add_argument( + "--split", + type=str, + default=None, + help="Hugging Face Hub dataset split (process all if omitted)", + )
239-241
: Create output_dir before writing JSONL files.Writing JSONL happens before megatron_preprocess_data() creates the dir; this can fail.
if args.input_path is None: args.input_path = [] + Path(args.output_dir).mkdir(parents=True, exist_ok=True)
275-277
: Filename/path bug: dataset names contain “/” and subset can be None.Current string concat can create unintended subdirs or “None” in filenames and fail. Sanitize components and use Path join; also write JSONL explicitly.
- json_file_path = args.output_dir + "/" + name + "_" + subset + "_" + split + ".jsonl" - dataset.to_json(json_file_path) - args.input_path += [json_file_path] + safe_name = name.replace("/", "__") + safe_subset = subset or "default" + json_file_path = Path(args.output_dir) / f"{safe_name}_{safe_subset}_{split}.jsonl" + dataset.to_json(str(json_file_path), lines=True) + args.input_path.append(str(json_file_path))
🧹 Nitpick comments (4)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (4)
40-41
: Make requests/datasets optional: move imports inside main().Avoid forcing library users to install optional CLI deps. Lazy-import inside main().
-import requests -from datasets import load_dataset @@ def main(): @@ - parser = argparse.ArgumentParser(prog="megatron_preprocess_data") + # Lazy imports to keep optional CLI deps out of library import path + import requests + from datasets import load_dataset + + parser = argparse.ArgumentParser(prog="megatron_preprocess_data")Also applies to: 206-216
242-253
: Harden JSON parsing for splits; guard missing/empty key.Avoid KeyError if API shape changes and short‑circuit if no splits.
- for entry in response.json()["splits"]: + data = response.json() + splits = data.get("splits", []) + if not splits: + print(f"No splits found for {args.dataset}.") + return + for entry in splits:
263-266
: Wrap load_dataset with error handling to continue on bad splits/configs.- print(f"Loading dataset {name} with subset {subset} and split {split}") - dataset = load_dataset(name, subset, split=split) + print(f"Loading dataset {name} with subset {subset} and split {split}") + try: + dataset = load_dataset(name, subset, split=split) + except Exception as e: + print(f"Failed to load {name} (subset={subset}, split={split}): {e}") + continue
239-289
: Optional: rely on datasets API to enumerate configs/splits and drop the raw HTTP call.Using datasets.get_dataset_config_names / get_dataset_split_names removes the requests dependency and one external endpoint.
If you want, I can propose a minimal patch replacing the HTTP call with datasets helpers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
modelopt/torch/utils/plugins/megatron_preprocess_data.py
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (1)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)
34-34
: LGTM: argparse import is appropriate for the new CLI.
import requests | ||
from datasets import load_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Ensure packaging lists new dependencies.
requests and datasets are new runtime deps for the CLI. Verify they’re declared (pyproject/requirements/extras).
Also applies to: 206-289
🏁 Script executed:
#!/bin/bash
# Verify packaging declares 'datasets' and 'requests'
rg -nP -C2 '^(datasets|requests)\b' -g 'pyproject.toml' -g 'requirements*.txt' -g 'setup.cfg' -g 'setup.py'
Length of output: 1129
Add 'requests' and 'datasets' to project packaging/runtime deps
modelopt/torch/utils/plugins/megatron_preprocess_data.py imports requests and datasets (lines 40–41; also referenced ~206–289). Search shows only 'datasets' in examples/*/requirements.txt and no top-level declaration for either. Add both to pyproject.toml (or setup.cfg / setup.py) or the CLI extras so the CLI runtime installs them.
🤖 Prompt for AI Agents
In modelopt/torch/utils/plugins/megatron_preprocess_data.py around lines 40-41
(and usages ~206-289), the module imports and uses the third-party packages
'requests' and 'datasets' but they are not declared in the top-level project
dependencies; add both 'requests' and 'datasets' to the project's
packaging/runtime dependencies (e.g., pyproject.toml [project]/dependencies or
setup.cfg/setup.py install_requires) or include them in the CLI extras so that
the CLI runtime installs them; pick appropriate minimal version constraints
consistent with examples/*/requirements.txt (or mirror examples) and run
dependency install/tests to verify imports resolve.
|
||
if dist.is_master(): | ||
run_config_name = f"{checkpoint_name}/modelopt_run_config.yaml" | ||
config_dict = _parse_transformer_config(copy.deepcopy(model[0].config.__dict__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChenhanYu I'm seeing an issue with the deepcopy
these fields are set on the config during training:
https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/8bb2309034bf2eb27783293cd228d86407db1cf9/src/megatron/bridge/training/setup.py#L267C5-L295
These are bound methods to DDP class instances which have pytorch process groups initialized as member variables. the deepcopy here is going into deepcopying the class itself, causing issues like this:
b-pretrain/0 [rank0]: File "/opt/TensorRT-Model-Optimizer/modelopt/torch/opt/plugins/mcore_dist_checkpointing.py", line 157, in save_sharded_modelopt_state
b-pretrain/0 [rank0]: config_dict = _parse_transformer_config(copy.deepcopy(model[0].config.__dict__))
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 226, in _deepcopy_method
b-pretrain/0 [rank0]: return type(x)(x.__func__, deepcopy(x.__self__, memo))
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]: state = deepcopy(state, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 196, in _deepcopy_list
b-pretrain/0 [rank0]: append(deepcopy(a, memo))
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]: state = deepcopy(state, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]: state = deepcopy(state, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]: state = deepcopy(state, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]: state = deepcopy(state, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
b-pretrain/0 [rank0]: y = _reconstruct(x, memo, *rv)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
b-pretrain/0 [rank0]: state = deepcopy(state, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
b-pretrain/0 [rank0]: y = copier(x, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
b-pretrain/0 [rank0]: y[deepcopy(key, memo)] = deepcopy(value, memo)
b-pretrain/0 [rank0]: ^^^^^^^^^^^^^^^^^^^^^
b-pretrain/0 [rank0]: File "/usr/lib/python3.12/copy.py", line 151, in deepcopy
b-pretrain/0 [rank0]: rv = reductor(4)
b-pretrain/0 [rank0]: ^^^^^^^^^^^
b-pretrain/0 [rank0]: TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
in megatron bridge, we have a yaml representer for this to serialize our configs to yaml:
so we don't see this when serializing our overall job config to yaml (which includes the transformer config):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The deepcopy here is likely not necessary. Let me remove it.
What does this PR do?
Type of change: ? new feature
Overview:
modelopt_run_config.yaml
to the Megatron-LM or Megatron-Bridge checkpoint that we process. It is essentially a dump of theTransformerConfig
to YAML which can be handy for knowledge distillation. NeMo2 checkpoint has a similar filemodel_config.yaml
and Megatron-Brdige hasrun_config.yaml
. So it is mainly useful for Megatron-LM.--tokenizer
which accepts aPreTrainedTokenizer
path. It can process datasets from HF hub directly by using--dataset
. Ifsubset
andsplit
are not provided, all subsets and splits will be processed.Usage
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit